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We propose a simple real-valued generalization of the well known integer-valued Erdos number 
as a topological, non-metric measure of the 'closeness' felt between two nodes in an undirected, 
weighted graph. These real- valued Erdos numbers are asymmetric and are able to distinguish 
between network topologies that standard distance metrics view as identical. We use this measure 
to study some simple analytically tractable networks, and show the utility of our measure to devise a 
ratings scheme based on the generalized Erdos number that we deploy on the data from the NetFlix 
prize, and find a significant improvement in our ratings prediction over a baseline. 



A variety of complex natural and artificial systems 
can be viewed as a network [1], with a set of nodes rep- 
resenting objects and a set of edges connecting these 
nodes representing interactions between objects. Such 
systems include protein [2] or metabolic [3, 4] net- 
works, computer networks and the world wide web 
[5, 6], disease propagation in populations [7, 8], and 
networks of human [7, 9] or other animal [10, 11] in- 
teractions. While much of the study of networks gener- 
ally involves characterizing both its internal structure 
[3, 8, 9] and the propagation of dynamical processes in 
it [1, 12], a basic question that continues to be of inter- 
est is that of characterizing closeness or connectedness 
in such networks. Various measures of the distance 
between nodes have been developed including the in- 
teger distance [13] (identical to the classic Erdos num- 
bers [14] which measure the authorship-distance to 
the famous Hungarian mathematician) or resistance- 
distance approaches [10, 15] which often have both a 
geometric and a topological character to them. In this 
paper, we develop a framework for determining the 
'closeness' between nodes in a weighted network by 
developing a generalized real- valued Erdos number, an 
inherently topological entity that incorporates nonlo- 
cal information about connectivity, is asymmetric, i.e. 
Eij 7^ Eji even if the underlying adjacency matrix 
is symmetric. Using analytically tractable symmet- 
ric networks, we show that these Erdos numbers can 
distinguish between topologies that are identical when 
viewed through the lens of common distance metrics 
[13, 15]. In order to show that these Erdos numbers 
have utility in making quantitative predictions about 
real- world networks, we also develop a basic predictor 
for a small subset of the NetFlix data [16], and find 
significant improvement over a baseline prediction. 

In order to develop a natural measure of the 'close- 
ness' between two nodes, we consider one of the sim- 
plest possible networks: a linear network of exactly 
three nodes (diagrammed in Fig. 1(a)). With Erdos 
indexed as 0, we define his closeness to himself as 
^oo = 0, as is the case in all distance metrics[13, 15]. 
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FIG. 1: (a) A simple linear author network. Bob has pub- 
lished only with Alice, so Eqb — Eoa +^oa- The competi- 
tion between the connections with Erdos and leads to Eqa 
given in (1). (b) Smaller sizes denote closeness to Erdos 
(large spheres implying large Eoi). Red coloring denotes 
strong interactions, while blue denotes weak interactions. 
Eoa increases as wab increases, (c) A simple cycle, for two 
authors weakly connected to Erdos, but strongly connected 
to one another. 

For a node B (Bob, say) directly connected to exactly 
one other node (Alice in this case), we define the close- 
ness felt by Bob towards Erdos as Eqb = Eqa + w ab^ 
with wab the weight of the edge joining Alice and Bob. 
The determination of Eoa is more ambiguous, since 
Alice is connected to two nodes. If we were to use 
a simple integer- [13, 14] or resistance-distance mea- 
sure [15] the distance between Erdos and Alice would 
be Rqa = w qa-> wnereas one would expect a realistic 
measure of closeness would depend on all of the nodes 
to which Alice is connected. 

To incorporate the effect of multiple connections be- 
tween Alice and the other nodes, we assume that the 
closeness Alice feels to Erdos is a function of the close- 
ness felt by all other nodes connected to Alice towards 
Erdos. In particular, we expect Eoa = / '({Eqi+w^}) , 
where Eqi + w^j would be the closeness Alice feels to 
Erdos in the absence of all other interactions. We ex- 
pect that the unknown functional form of / should (1) 
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penalize large values of Eqi + w^l (i-e. that nodes 
that feel close to Erdos contribute more than nodes 
that feel far from Erdos when computing Eqa), and 
(2) that nodes with high weight have a higher contri- 
bution than those of low weights. These expectations 
are diagrammed schematically in Fig. 1(b), and sug- 
gest the use of a weighted harmonic mean of the form 
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where the necessity of using the scaled weight 
WAi/(w>Ao + wab) will be addressed below. We note 
that although (1) is the simplest and most natural 
functional form that satisfies the constraints above, 
other forms are certainly possible. Furthermore, the 
centrality of Erdos in any network may clearly be re- 
placed by that of any node z, so that we can generalize 
(1) to define the closeness felt by node j towards node 
% as 
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where Wji is the weight of the edge between j and /, 
Cj is the set of nodes directly connected to node j, 
and dj = ^2 t Wji is the weighted degree of node j. 
The reason for scaled weights Wji/dj becomes clearer 
in (2): unsealed weights Wji would imply that node 
j would have a low Erdos number E^ (i.e. feel very 
close to node i) by having many connections (large 
dj), even if these connections led to nodes with high 
Erdos numbers En. We note that if Wji = eSji for 
some Zq? then Eji ~ e _1 oo as e — > 0, so as a 
node with vanishing weight for its only connection to 
the network will have its Erdos numbers diverge, as it 
becomes 'disconnected'. 

To illuminate aspects of the generalized Erdos num- 
bers we first consider a simple, three-node cycle shown 
in Fig. 1(c), where two strongly connected nodes 
with weight wab = w ^> 1 are weakly connected 
to a third node (indexed 0). Solving (2) for the 
Erdos numbers Eij for this simple network yields 
E 0A = E 0B ~ (1 + y/S)/2 + Olw' 1 ) for large w, 
showing that the two nodes move away from the third 
as their connection strengthens (note that Ea = 1 
for w = 1). Nodes A and B move towards each 
other as w increases, as can be seen by computing 
Eab — Eba ~ + 0(w~ 2 ). The third node has a 
low degree and is closer to the other nodes than they 
are to it (E A o = E B o ~ 1 + 0(w -1 )). Fig. 1(c) thus 
displays the inherent asymmetry in the Erdos numbers 
(Eao 7^ Eoa), indicating that Eij is not a distance 
metric, but rather an inherently topological measure. 

The Erdos numbers differ from common distance 
metrics in a number of ways, as can be seen by ex- 
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FIG. 2: The Erdos numbers computed for the open (filled 
symbols) and closed (open symbols) linear networks, along 
with the theoretical scaling of Eoi = i(i + 4)/3 (solid red 
line). Insets schematically diagram the open and closed 
linear networks, as well as a tree network with m — 3 
connections per node and length L = 3 (discussed further 
in the text). 



amining some more complex networks. In a fully con- 
nected network of N + 1 nodes of constant interaction 



strength Wij = w(l 
Erdos number {E^ - 
of (2) given by 
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- 5{j), each node has an identical 
E for i 7^ 0), with the equivalent 
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yielding E = V N /w. As the strength w of each con- 
nection between nodes increases, the Erdos number of 
all nodes decrease, since all nodes become closer to 
each other as well as to Erdos. However, as the num- 
ber of nodes increases, with edges added to keep the 
network fully connected, the importance of an indi- 
vidual edge is lessened and all nodes will feel less close 
to one another. This is in contrast with other mea- 
sures such as the resistance distance, which decreases 
as new nodes are added or integer distance, which re- 
mains constant independent N. 

We next consider generalizations of the simple net- 
works (Fig. 1) to extended linear networks and a cycle- 
free tree (Fig. 2) where each node is connected to ex- 
actly m nodes, except for the endpoints. For the open 
networks with m > 2, the resistance distance between 
any two points is Rij = \i — j\, since there are no cy- 
cles, while the generalized Erdos numbers between a 
node i and the base of a branch are 
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with the boundary conditions Eqq = and Eql = 
Eq l-1 + w~ x . The closed linear network can be 



studied using the same difference equation, with the 
boundary conditions Eo = £?o,l+i = after insertion 
of a virtual node. Em(w) = E Qi (w = l)/w for con- 
stant interaction strength w, so the weights can be 
factored out, and are ignored below. While the differ- 
ence equations are not exactly solvable, if m = 2, we 
can see that E^ = z(i + 4)/3 is a solution that satisfies 
(4) and the boundary condition Eqq = 0. For i « L, 
deviations from this predicted scaling are expected to 
occur due to the boundary condition at the distant 
ends. Interestingly, the quadratic scaling E^ ~ i 2 
for distant nodes matches the time for particle diffu- 
sion from node to z, taking time r ~ i 2 . For tree 
networks with large m (inset of Fig. 2), we find that 
Eoi = Eo yi -i-\-(m — iy asymptotically satisfies the dif- 
ference equation with the boundary condition Eqq = 0. 
The tree network produces an exponential growth with 
i for large m, rather than the quadratic growth seen 
for m = 2, clearly showing that the Erdos numbers 
are able to distinguish between the global topology of 
these very different network more accurately than a 
resistance distance approach. 

In order to determine the numerical values of the 
Erdos numbers for this linear network (with w = 1), we 
determine an iterative solution for E 0i , with 2/E^f = 
(E<f7» + 1)" 1 + (E^l + I)" 1 and E$> = i(i + 4)/3. 

E$ is computed until e(t) = maxj \E$ — Eq~ X \ < 
0.01. The resulting numerical solutions to the Erdos 
numbers are shown in Fig. 2, with the solid red 
line denoting the predicted quadratic growth, E^ = 
i(i + 4)/3. The predicted scaling agrees well with the 
numerical results [17], with deviations occuring near 
the i — N endpoint for the open network and near the 
i — N/2 midpoint for the closed network. 

To see if the Erdos numbers can make quantitative 
predictions about real- world networks, we consider the 
data provided for the NetFlix Prize [16], a competition 
to improve algorithms for the prediction of movie rat- 
ings. Here, we use the generalized Erdos numbers as a 
means to characterize an interaction 'energy' between 
nodes when predicting the rating user i gives to movie 
Z, pf\ using the Boltzmann weighted average taken 
from statistical mechanics, 

p f) = J2r?e-^ / £e"^. (5) 

jeSi ' jesi 

P is a free parameter (an inverse temperature), de- 
scribing how important distant nodes are in determin- 
ing the predicted rating. f3Eij determines which nodes 
are important to the average and which are not, and 
assigns a lower weight to the latter. In order to com- 
pute the Erdos numbers in (5), we need to generate a 
weighted graph use the NetFlix data. 
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case 


N 


k 


a 


num. with n > 30 


ft/3 = 2 {rimin = 30) 


1 


3000 


553 


2 


304 (55%) 


3.56% 


2 


3000 


557 


4 


302 (54%) 


4.71% 


3 


3000 


1297 


8 


782 (60%) 


3.35% 


4 


6000 


368 


8 


188 (51%) 


4.53% 



TABLE I: Parameters used in the NetFlix analysis. N is 
the number of users in the dataset, and k is the number 
of users for whom predictions were made. The number of 
nodes with m > n m in out of the k considered are shown, as 
well as the average percent improvement for these nodes. 




50 100 150 

^min 



FIG. 3: Percent improvement at /3 = 2 compared to f3 — 
as a function of rimin- See Table 1 for parameters. Case 
1 is shown as open circles, case 2 as filled circles, case 3 
as open squares, and case 4 as filled squares. Error bars 
(using the standard deviation of the mean) are shown only 
for case 4, with the errors for the other cases being smaller. 
Upper inset shows (p(/3)) as a function of /3 for varying 
rimin (higher curves correspond to smaller nmin- The lower 
inset shows the fraction of users satisfying n > rimin- 

While we could represent the NetFlix data as a bi- 
partite network [18], where the users and movies form 
sets of disjoint nodes, we instead use the movie ratings 
(an integer between 1 and 5) to determine a weight be- 
tween two users, using the simple power law form 

w ij= J2 (5-|Arg>|r (6) 

leMij 

with Arfj = r • ^ — r^p , r • ^ the rating user i gave to 

movie I (0 < |Ar^| < 4), and is the set of movies 
that both user i and j have rated (wij = if i and 
j have rated no movies in common). If users i and 
j disagree on all movies (i.e. one rates a 5 while the 
other rates a 1), the weight between them is Wij = 
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|M^-|, while perfect agreement gives a weight Wij = 
5 a x |Mij|. Implicit in this definition is that users 
who seek out the same movies have more similar tastes 
than those who do not (even if they do not agree), 
and that users who agree on movies are more likely to 
have similar tastes than those who disagree. The free 
parameter a determines the importance of agreement, 
with a = implying that disagreement in the ratings 
are irrelevant, while agreement becomes dominant as 
a — >> oo. 

To test our prediction scheme, we select a subset 
of the dataset comprised of N users and 6000 movies 
(the parameters are listed in Table 1). For varying 
values of TV and a, we choose k users from the data 
set in order to test the efficacy of our approach (k is 
shown in the third column in Table 1) . For each node 
i selected, we iteratively perform the followings steps 
for each movie I user i has seen: (I) remove the rating 
user i gave to movie I from the network, (II) compute 
the Erdos numbers for this modified network using (2), 
and (III) compute the predicted rating user i gives 
to movie I using (5) as a function of /3, pf\f3). The 
average improvement as a function of j3 is determined 
from the RMSD p?(0) = £Jrf } -pff/rii, where m 
is the number of movies that user i has seen. 

The RMSD pi depends strongly on the number of 
movies (n^) that the user has seen, as can be seen 
by computing the average RMSD restricted to users 
with rii > n m i n . In the upper inset of Fig. 3, a 
pronounced minimum in (p(/3)) occurs for increasing 
n m i n . The relative improvement of (5) over an un- 
weighted average (ap = 1 — (p(f3)} / (p(0)}) is signifi- 
cant for n m i n > 30 as seen in the main panel of Fig. 
3. Restricting ourselves to users with rii > 30 ratings 
gives an improvement of at least 3-5% at /? = 2 for 
all values of a and k examined (over 50% of the nodes 
included in the average, see Table 1). For very well 
connected nodes (with n m i n = 200 or about 8% of 
the nodes in each case, see the lower inset of Fig. 3) 
the average improvement is quite significant, ranging 
from 4.5-9.5%. The dependence of the improvement 
on n m i n is somewhat unsurprising, as the preferences 
of users who have seen very few movies will be much 
more difficult to predict. We also note that the neg- 
ative improvement for small n min is due to the fact 
that the positions of the minimum in (p((3)) saturate 
at f3 = 2 for large n m ^ n , but are far from this value for 
small nmin. 

Our minimal definition of the Generalized Erdos 
number which arises from an asymmetric measure 
of 'closeness' takes the global topology of the net- 
work into account. We have shown that it can be 
used to characterize connectivity on simple analyti- 
cally tractable networks as well as the basis for a rank- 



ing scheme for data sets from the Netflix prize, where it 
outperforms baseline schemes. The weighted average 
in (5) can be implemented in other prediction schemes, 
and a more complex form for the weighting between 
nodes (incorporating temporal information, for exam- 
ple) may give further improvements in predictions. A 
natural next step of any measure of connectedness is 
to to use it in additional applications: problems as- 
sociated with community detection in graphs, as well 
as the dynamics of diffusion, epidemics and the be- 
havior of dynamic networks with time-dependent edge 
weights beckon. 
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